Kierunek: Informatyka - Data Science
Autor rozwiązań: Bartłomiej Jamiołkowski
import sklearn
import sklearn.datasets
import sklearn.ensemble
import numpy as np
import pandas as pd
import lime
import lime.lime_tabular
from __future__ import print_function
np.random.seed(1)
We will analyse a dataset that has both numerical and categorical features. Here, the task is to predict whether a person makes over 50K dollars per year.
feature_names = ["Age", "Workclass", "fnlwgt", "Education", "Education-Num", "Marital Status", "Occupation", "Relationship", "Race", "Sex",
"Capital Gain", "Capital Loss","Hours per week", "Country"]
data = np.genfromtxt('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', delimiter=', ', dtype=str)
Take a look at the data. Let's analyse the labels:
labels = data[:, 14]
np.unique(labels)
array(['<=50K', '>50K'], dtype='<U26')
In order to use it, we need to preprocess the labels to have discrete values:
le= sklearn.preprocessing.LabelEncoder()
le.fit(labels)
labels = le.transform(labels)
class_names = le.classes_
np.unique(labels)
array([0, 1])
Let's remove labels from data and take a look at the data:
data = data[:, :-1]
pd.DataFrame(data, columns=feature_names)
| Age | Workclass | fnlwgt | Education | Education-Num | Marital Status | Occupation | Relationship | Race | Sex | Capital Gain | Capital Loss | Hours per week | Country | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States |
| 1 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States |
| 2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States |
| 3 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States |
| 4 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 32556 | 27 | Private | 257302 | Assoc-acdm | 12 | Married-civ-spouse | Tech-support | Wife | White | Female | 0 | 0 | 38 | United-States |
| 32557 | 40 | Private | 154374 | HS-grad | 9 | Married-civ-spouse | Machine-op-inspct | Husband | White | Male | 0 | 0 | 40 | United-States |
| 32558 | 58 | Private | 151910 | HS-grad | 9 | Widowed | Adm-clerical | Unmarried | White | Female | 0 | 0 | 40 | United-States |
| 32559 | 22 | Private | 201490 | HS-grad | 9 | Never-married | Adm-clerical | Own-child | White | Male | 0 | 0 | 20 | United-States |
| 32560 | 52 | Self-emp-inc | 287927 | HS-grad | 9 | Married-civ-spouse | Exec-managerial | Wife | White | Female | 15024 | 0 | 40 | United-States |
32561 rows × 14 columns
The dataset has many categorical features, which we need to preprocess like we did with the labels before - our explainer (and most classifiers) takes in numerical data, even if the features are categorical. We thus transform all of the string attributes into integers, using sklearn's LabelEncoder. We use a dict to save the correspondence between the integer values and the original strings, so that we can present this later in the explanations.
categorical_features = [1, 3, 5, 6, 7, 8, 9, 13]
categorical_names = {}
for feature in categorical_features:
le = sklearn.preprocessing.LabelEncoder()
le.fit(data[:, feature])
data[:, feature] = le.transform(data[:, feature])
categorical_names[feature] = le.classes_
data = data.astype(float)
Final look at the preprocessed data:
pd.DataFrame(data, columns=feature_names)
| Age | Workclass | fnlwgt | Education | Education-Num | Marital Status | Occupation | Relationship | Race | Sex | Capital Gain | Capital Loss | Hours per week | Country | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 39.0 | 7.0 | 77516.0 | 9.0 | 13.0 | 4.0 | 1.0 | 1.0 | 4.0 | 1.0 | 2174.0 | 0.0 | 40.0 | 39.0 |
| 1 | 50.0 | 6.0 | 83311.0 | 9.0 | 13.0 | 2.0 | 4.0 | 0.0 | 4.0 | 1.0 | 0.0 | 0.0 | 13.0 | 39.0 |
| 2 | 38.0 | 4.0 | 215646.0 | 11.0 | 9.0 | 0.0 | 6.0 | 1.0 | 4.0 | 1.0 | 0.0 | 0.0 | 40.0 | 39.0 |
| 3 | 53.0 | 4.0 | 234721.0 | 1.0 | 7.0 | 2.0 | 6.0 | 0.0 | 2.0 | 1.0 | 0.0 | 0.0 | 40.0 | 39.0 |
| 4 | 28.0 | 4.0 | 338409.0 | 9.0 | 13.0 | 2.0 | 10.0 | 5.0 | 2.0 | 0.0 | 0.0 | 0.0 | 40.0 | 5.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 32556 | 27.0 | 4.0 | 257302.0 | 7.0 | 12.0 | 2.0 | 13.0 | 5.0 | 4.0 | 0.0 | 0.0 | 0.0 | 38.0 | 39.0 |
| 32557 | 40.0 | 4.0 | 154374.0 | 11.0 | 9.0 | 2.0 | 7.0 | 0.0 | 4.0 | 1.0 | 0.0 | 0.0 | 40.0 | 39.0 |
| 32558 | 58.0 | 4.0 | 151910.0 | 11.0 | 9.0 | 6.0 | 1.0 | 4.0 | 4.0 | 0.0 | 0.0 | 0.0 | 40.0 | 39.0 |
| 32559 | 22.0 | 4.0 | 201490.0 | 11.0 | 9.0 | 4.0 | 1.0 | 3.0 | 4.0 | 1.0 | 0.0 | 0.0 | 20.0 | 39.0 |
| 32560 | 52.0 | 5.0 | 287927.0 | 11.0 | 9.0 | 2.0 | 4.0 | 5.0 | 4.0 | 0.0 | 15024.0 | 0.0 | 40.0 | 39.0 |
32561 rows × 14 columns
pd.DataFrame(data, columns=feature_names).describe()
| Age | Workclass | fnlwgt | Education | Education-Num | Marital Status | Occupation | Relationship | Race | Sex | Capital Gain | Capital Loss | Hours per week | Country | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 32561.000000 | 32561.000000 | 3.256100e+04 | 32561.000000 | 32561.000000 | 32561.000000 | 32561.000000 | 32561.000000 | 32561.000000 | 32561.000000 | 32561.000000 | 32561.000000 | 32561.000000 | 32561.000000 |
| mean | 38.581647 | 3.868892 | 1.897784e+05 | 10.298210 | 10.080679 | 2.611836 | 6.572740 | 1.446362 | 3.665858 | 0.669205 | 1077.648844 | 87.303830 | 40.437456 | 36.718866 |
| std | 13.640433 | 1.455960 | 1.055500e+05 | 3.870264 | 2.572720 | 1.506222 | 4.228857 | 1.606771 | 0.848806 | 0.470506 | 7385.292085 | 402.960219 | 12.347429 | 7.823782 |
| min | 17.000000 | 0.000000 | 1.228500e+04 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 |
| 25% | 28.000000 | 4.000000 | 1.178270e+05 | 9.000000 | 9.000000 | 2.000000 | 3.000000 | 0.000000 | 4.000000 | 0.000000 | 0.000000 | 0.000000 | 40.000000 | 39.000000 |
| 50% | 37.000000 | 4.000000 | 1.783560e+05 | 11.000000 | 10.000000 | 2.000000 | 7.000000 | 1.000000 | 4.000000 | 1.000000 | 0.000000 | 0.000000 | 40.000000 | 39.000000 |
| 75% | 48.000000 | 4.000000 | 2.370510e+05 | 12.000000 | 12.000000 | 4.000000 | 10.000000 | 3.000000 | 4.000000 | 1.000000 | 0.000000 | 0.000000 | 45.000000 | 39.000000 |
| max | 90.000000 | 8.000000 | 1.484705e+06 | 15.000000 | 16.000000 | 6.000000 | 14.000000 | 5.000000 | 4.000000 | 1.000000 | 99999.000000 | 4356.000000 | 99.000000 | 41.000000 |
As we see, the categorical data has numerical values indicating the categories now.
We now split the data into training and testing:
train, test, labels_train, labels_test = sklearn.model_selection.train_test_split(data, labels, train_size=0.80)
Finally, we use a One-hot encoder, so that the classifier does not take our categorical features as continuous features. We will use this encoder only for the classifier, not for the explainer - and the reason is that the explainer must make sure that a categorical feature only has one value set to True.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
encoder = ColumnTransformer([("OneHot", OneHotEncoder(), categorical_features)], remainder = 'passthrough')
encoder.fit(data)
encoded_train = encoder.transform(train)
We will use gradient boosted trees as the model, using the xgboost package.
import xgboost
gbtree = xgboost.XGBClassifier(n_estimators=300, max_depth=5)
gbtree.fit(encoded_train, labels_train)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=5, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=300, n_jobs=None,
num_parallel_tree=None, random_state=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=5, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=300, n_jobs=None,
num_parallel_tree=None, random_state=None, ...)sklearn.metrics.accuracy_score(labels_test, gbtree.predict(encoder.transform(test)))
0.868417012129587
Our predict function, which first transforms the data into the one-hot representation:
predict_fn = lambda x: gbtree.predict_proba(encoder.transform(x)).astype(float)
Tabular explainers need a training set. The reason for this is because we compute statistics on each feature (column). If the feature is numerical, we compute the mean and std, and discretize it into quartiles. If the feature is categorical, we compute the frequency of each value. For this tutorial, we'll only look at numerical features.
We use these computed statistics for two things:
We now create our explainer. The categorical_features parameter lets it know which features are categorical. The categorical names parameter gives a string representation of each categorical feature's numerical value.
explainer = lime.lime_tabular.LimeTabularExplainer(train, feature_names=feature_names, class_names=class_names,
categorical_features=categorical_features,
categorical_names=categorical_names, kernel_width=3, verbose=True)
We now show a few explanations with a verbose set to True.
np.random.seed(1)
i = 1653
exp = explainer.explain_instance(test[i], predict_fn, num_features=5)
exp.show_in_notebook(show_all=False)
Intercept -0.004040331593255508 Prediction_local [1.03542247] Right: 0.9999810457229614
exp.as_list()
[('Capital Gain > 0.00', 0.7137120588369416),
('Marital Status=Married-civ-spouse', 0.10496853816074198),
('Education-Num > 12.00', 0.08327993642177241),
('Hours per week > 45.00', 0.07942690425041131),
('Age > 48.00', 0.058075367070340785)]
First, note that the row we explained is displayed on the right side, in table format. Since we had the show_all parameter set to false, only the features used in the explanation are displayed.
The "value" column displays the original value for each feature.
The explanations for categorical features are based not only on features, but on feature-value pairs.
LIME has discretized the features in the explanation. This is because we let discretize_continuous=True in the constructor (this is the default). Discretized features make for more intuitive explanations.
As for the values displayed after setting the "verbose" parameter to True: Intercept is the intercept of the linear model used inside the LIME algorithm. Prediction_local is the prediction of this model for the instance of interest, and Right is the xgboost model's prediction for the same instance. We analyse the weights with respect to the intercept.
Note that capital gain has very high weight. This makes sense. Now let's see an example where the prediction is different:
i = 92
exp = explainer.explain_instance(test[i], predict_fn, num_features=5)
exp.show_in_notebook(show_all=False)
Intercept 0.8073078235855727 Prediction_local [0.09965143] Right: 0.07562913000583649
Let's also analyse one "weird" example. Take a look at the explanations:
i = 18
exp = explainer.explain_instance(test[i], predict_fn, num_features=5)
exp.show_in_notebook(show_all=False)
Intercept 0.12538697704831345 Prediction_local [0.82148369] Right: 0.005185970105230808
We see that the model predicted the output "<=50K" even though the explanation suggests a different result. Let's do the analysis:
intercept + weights = prediction_local, which is our linear model's prediction, corresponding to label ">50K". The xgboost's prediction is, however, 0.005, so in this case the LIME explainer is not useful.
Choose a different model for preparing predictions (you may want to use sklearn models).
I will use random forest as the model from the scikit-learn package.
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators=300, max_depth=5)
forest.fit(encoded_train, labels_train)
RandomForestClassifier(max_depth=5, n_estimators=300)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(max_depth=5, n_estimators=300)
sklearn.metrics.accuracy_score(labels_test, forest.predict(encoder.transform(test)))
0.841087056655919
A predict function, which first transforms the data into the one-hot representation:
predict_fn_rf = lambda x: forest.predict_proba(encoder.transform(x)).astype(float)
Prepare an explainer.
explainer_rf = lime.lime_tabular.LimeTabularExplainer(train, feature_names=feature_names, class_names=class_names,
categorical_features=categorical_features,
categorical_names=categorical_names, kernel_width=3, verbose=True)
Use it on the three instances explained in the tutorial. Did the explanations change? Analyse if the explanations make sense.
i = 1653
exp = explainer.explain_instance(test[i], predict_fn_rf, num_features=5)
Intercept 0.08437412275782188 Prediction_local [0.63138254] Right: 0.7662074482400795
exp.show_in_notebook(show_all=False)
exp.as_list()
[('Capital Gain > 0.00', 0.21309153939148034),
('Marital Status=Married-civ-spouse', 0.14422292539260886),
('Relationship=Husband', 0.0829552633376927),
('Education-Num > 12.00', 0.0776684225849026),
('Sex=Male', 0.029070262799571558)]
When I switched from the xgboost model to the RandomForestClassifier, the explanations for the predictions changed notably. The RandomForestClassifier model provided a local prediction of about 63% for earning more than $50K, while its actual prediction was 76%. This is more reasonable than xgboost, which often approached or exceeded 100%, suggesting potential overfitting. Both models identified capital gain as a key factor, but RandomForestClassifier assigned it a lower importance, likely due to its more balanced approach to feature importance. The model also emphasized marital status, especially being married, indicating a stronger association with higher earnings. Additionally, education levels were crucial in both models, but RandomForestClassifier placed more weight on higher education, suggesting it recognized the long-term benefits of education on income. The influence of being a husband and male was also notable in RandomForestClassifier, reflecting traditional roles. In contrast relationship and sex were replaced in xgboost model by hours per week and age.
i = 92
exp = explainer.explain_instance(test[i], predict_fn_rf, num_features=5)
exp.show_in_notebook(show_all=False)
Intercept 0.3799586335900268 Prediction_local [0.3194128] Right: 0.31054536807223104
exp.as_list()
[('Capital Gain <= 0.00', -0.2149779311539793),
('Marital Status=Married-civ-spouse', 0.14155843984900943),
('Relationship=Husband', 0.08398463313836103),
('Education-Num <= 9.00', -0.03694787067103361),
('Hours per week <= 40.00', -0.03416310558140476)]
When I switched from the xgboost model to the RandomForestClassifier, the explanations for the predictions changed notably. The RandomForestClassifier model provided a local prediction of about 32% for earning more than $50K, while its actual prediction was 31%. This is more reasonable than xgboost, which often approached 7-9%. Both models identified capital gain as a key factor, but RandomForestClassifier assigned it a lower importance. Whats more in this case capital gain has negative impact on class > 50K. In both mmodels martial status (married) has positive impact on class > 50K. Education and hours per work are both considered in mentioned models. They have negative impact on class > 50K.
i = 18
exp = explainer.explain_instance(test[i], predict_fn_rf, num_features=5)
exp.show_in_notebook(show_all=False)
Intercept 0.16521671041469516 Prediction_local [0.52623128] Right: 0.30764032838415034
exp.as_list()
[('Capital Gain > 0.00', 0.2087083682084347),
('Marital Status=Married-civ-spouse', 0.14441083188267068),
('Relationship=Husband', 0.07888125296319605),
('Education-Num <= 9.00', -0.03680711651254973),
('Hours per week <= 40.00', -0.03417877162925555)]
When I switched from the xgboost model to the RandomForestClassifier, the explanations for the predictions changed notably. The RandomForestClassifier model provided a local prediction of about 52% for earning more than $50K, while its actual prediction was 31%. In both cases this example was wrongly classified. Both models identified capital gain as a key factor, but RandomForestClassifier assigned it a lower importance. In both models martial status (married) has similar impact and importance. Education number and hours per week have negative impact on class >50K. They have quite greater importnace in xgboost than in RandomForestClassifier.
Explain an example where the model prediction was incorrect. Comment on the results.
np.where(labels_test != forest.predict(encoder.transform(test)))[0][0]
2
i = 2
exp = explainer.explain_instance(test[i], predict_fn_rf, num_features=5)
exp.show_in_notebook(show_all=False)
Intercept 0.26244050884269826 Prediction_local [0.3402951] Right: 0.400672106113666
It is noticeable that the model predicted the output ">50K" even though the explanation suggests a different result. The reason behind the error can be traced to the 'Capital Gain' feature, which has posotive impact on class >50K. Despite the fact it carries a higher weight, other features, such as martial status (never married), relationship (own child) and quite young age have somehow higher impact and these features are likely to contribute to the other class..
Analyse one more example with an interesting explanation.
i = 420
exp = explainer.explain_instance(test[i], predict_fn_rf, num_features=5)
exp.show_in_notebook(show_all=False)
Intercept 0.5145549518420021 Prediction_local [0.04898453] Right: 0.03076342627841422
This example may be interesting, because there are many premises to predict one particular class <=50K. The person from this row is 19 years old and has never been married. It means that he or she has not already graduated from studies. As a result it is unlikely that he or she rise a lot of money. Therefore, I am convinced that all presented features and classification result have sense.